88 research outputs found
An Indictment of Bright Line Tests for Honest Services Mail Fraud
Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key
primitive for many high performance graph algorithms as well as for some linear
solvers, such as algebraic multigrid. Here we show that SpGEMM also yields
efficient algorithms for general sparse-matrix indexing in distributed memory,
provided that the underlying SpGEMM implementation is sufficiently flexible and
scalable. We demonstrate that our parallel SpGEMM methods, which use
two-dimensional block data distributions with serial hypersparse kernels, are
indeed highly flexible, scalable, and memory-efficient in the general case.
This algorithm is the first to yield increasing speedup on an unbounded number
of processors; our experiments show scaling up to thousands of processors in a
variety of test scenarios
High-Quality Shared-Memory Graph Partitioning
Partitioning graphs into blocks of roughly equal size such that few edges run
between blocks is a frequently needed operation in processing graphs. Recently,
size, variety, and structural complexity of these networks has grown
dramatically. Unfortunately, previous approaches to parallel graph partitioning
have problems in this context since they often show a negative trade-off
between speed and quality. We present an approach to multi-level shared-memory
parallel graph partitioning that guarantees balanced solutions, shows high
speed-ups for a variety of large graphs and yields very good quality
independently of the number of cores used. For example, on 31 cores, our
algorithm partitions our largest test instance into 16 blocks cutting less than
half the number of edges than our main competitor when both algorithms are
given the same amount of time. Important ingredients include parallel label
propagation for both coarsening and improvement, parallel initial partitioning,
a simple yet effective approach to parallel localized local search, and fast
locality preserving hash tables
Recommended from our members
Parallel algorithms for finding connected components using linear algebra
Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a class of parallel connected-component algorithms designed using linear-algebraic primitives. These algorithms are based on a PRAM algorithm by Shiloach and Vishkin and can be designed using standard GraphBLAS operations. We demonstrate two algorithms of this class, one named LACC for Linear Algebraic Connected Components, and the other named FastSV which can be regarded as LACC's simplification. With the support of the highly-scalable Combinatorial BLAS library, LACC and FastSV outperform the previous state-of-the-art algorithm by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC and FastSV scale to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperform previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights
RDMA vs. RPC for implementing distributed data structures
Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures
Implementing Push-Pull Efficiently in GraphBLAS
We factor Beamer's push-pull, also known as direction-optimized
breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them
for generalizability, asymptotic speedup, and contribution to overall speedup.
We demonstrate that masking is critical for high performance and can be
generalized to all graph algorithms where the sparsity pattern of the output is
known a priori. We show that these graph algorithm optimizations, which
together constitute DOBFS, can be neatly and separably described using linear
algebra and can be expressed in the GraphBLAS linear-algebra-based framework.
We provide experimental evidence that with these optimizations, a DOBFS
expressed in a linear-algebra-based graph framework attains competitive
performance with state-of-the-art graph frameworks on the GPU and on a
multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph.Comment: 11 pages, 7 figures, International Conference on Parallel Processing
(ICPP) 201
Automatic Generation of Efficient Sparse Tensor Format Conversion Routines
This paper shows how to generate code that efficiently converts sparse
tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL,
and many others. We decompose sparse tensor conversion into three logical
phases: coordinate remapping, analysis, and assembly. We then develop a
language that precisely describes how different formats group together and
order a tensor's nonzeros in memory. This lets a compiler emit code that
performs complex remappings of nonzeros when converting between formats. We
also develop a query language that can extract statistics about sparse tensors,
and we show how to emit efficient analysis code that computes such queries.
Finally, we define an abstract interface that captures how data structures for
storing a tensor can be efficiently assembled given specific statistics about
the tensor. Disparate formats can implement this common interface, thus letting
a compiler emit optimized sparse tensor conversion code for arbitrary
combinations of many formats without hard-coding for any specific combination.
Our evaluation shows that the technique generates sparse tensor conversion
routines with performance between 1.00 and 2.01 that of hand-optimized
versions in SPARSKIT and Intel MKL, two popular sparse linear algebra
libraries. And by emitting code that avoids materializing temporaries, which
both libraries need for many combinations of source and target formats, our
technique outperforms those libraries by 1.78 to 4.01 for CSC/COO to
DIA/ELL conversion.Comment: Presented at PLDI 202
Unveiling Relations in the Industry 4.0 Standards Landscape based on Knowledge Graph Embeddings
Industry~4.0 (I4.0) standards and standardization frameworks have been
proposed with the goal of \emph{empowering interoperability} in smart
factories. These standards enable the description and interaction of the main
components, systems, and processes inside of a smart factory. Due to the
growing number of frameworks and standards, there is an increasing need for
approaches that automatically analyze the landscape of I4.0 standards.
Standardization frameworks classify standards according to their functions into
layers and dimensions. However, similar standards can be classified differently
across the frameworks, producing, thus, interoperability conflicts among them.
Semantic-based approaches that rely on ontologies and knowledge graphs, have
been proposed to represent standards, known relations among them, as well as
their classification according to existing frameworks. Albeit informative, the
structured modeling of the I4.0 landscape only provides the foundations for
detecting interoperability issues. Thus, graph-based analytical methods able to
exploit knowledge encoded by these approaches, are required to uncover
alignments among standards. We study the relatedness among standards and
frameworks based on community analysis to discover knowledge that helps to cope
with interoperability conflicts between standards. We use knowledge graph
embeddings to automatically create these communities exploiting the meaning of
the existing relationships. In particular, we focus on the identification of
similar standards, i.e., communities of standards, and analyze their properties
to detect unknown relations. We empirically evaluate our approach on a
knowledge graph of I4.0 standards using the Trans family of embedding
models for knowledge graph entities. Our results are promising and suggest that
relations among standards can be detected accurately.Comment: 15 pages, 7 figures, DEXA2020 Conferenc
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
- …